Pierre Mulliez
‘As part of the course “Programming in R” at IE - HST, we foster our newly learned skills by participating in a Kaggle competition. The problem requires us to do Exploratory Data Analysis, Data Cleaning and Manipulation and implement some sort of Machine Learning in R.’
The data can be found here
In order to carry out this task, we are first going to take a closer look at the data available, to determine which transformations should be taken place before we proceed to data processing stage and then make the predicitive model.
We have designed several plots to understand and include the right variable in our prediction, first we took a closer look at the stations pattern depending on the dates:
We have merged the station additional information file with the summary of each stations as a data table to produce a comprehensible comparison between the average of each stations accross all date, its altitude and coordinate
## Loading required package: ggplot2
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
## Using zoom = 7...
## Source : http://tile.stamen.com/terrain/7/27/49.png
## Source : http://tile.stamen.com/terrain/7/28/49.png
## Source : http://tile.stamen.com/terrain/7/29/49.png
## Source : http://tile.stamen.com/terrain/7/30/49.png
## Source : http://tile.stamen.com/terrain/7/27/50.png
## Source : http://tile.stamen.com/terrain/7/28/50.png
## Source : http://tile.stamen.com/terrain/7/29/50.png
## Source : http://tile.stamen.com/terrain/7/30/50.png
## Source : http://tile.stamen.com/terrain/7/27/51.png
## Source : http://tile.stamen.com/terrain/7/28/51.png
## Source : http://tile.stamen.com/terrain/7/29/51.png
## Source : http://tile.stamen.com/terrain/7/30/51.png
## Using zoom = 7...
Following the analysis on their relative observation based on their location we thought we could establish some segmentation:
train_index <- sample(1:nrow(solar_data), 0.7*nrow(solar_data))
val_index <- sample(setdiff(1:nrow(solar_data), train_index), 0.15*nrow(solar_data));
test_index <- setdiff(1:nrow(solar_data), c(train_index, val_index))
While looking for our top performing model, we looked at the variety of different algorithms that could be used for this purpose, including xgboost and svm. After testing multiple models, we came to the conclusion that X model [TBA based on Max’s results] was best suited for the purposes of this assigments, so we are going to showcase it below.